Notebook to filter, clean, and join raw datafiles

Output clean data to the appropriate folder

Add links to sections in the notebook

ERCOT wind & load data
- Export clean data



In [1]:

    
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import glob
import os









    



/Users/Home/anaconda/lib/python2.7/site-packages/matplotlib/font_manager.py:273: UserWarning: Matplotlib is building the font cache using fc-list. This may take a moment.
  warnings.warn('Matplotlib is building the font cache using fc-list. This may take a moment.')

ERCOT wind and load data

Find filenames for all .xls and .xlsx files in the wind generation subfolder



In [7]:

    
path = 'ERCOT/Hourly wind generation'
full_xls = os.path.join(path, '*.xls')
full_xlsx = os.path.join(path, '*.xlsx')

files = glob.glob(full_xls)
files.extend(glob.glob(full_xlsx))



In [17]:

    
cols = ['ERCOT Load, MW', 'Total Wind Installed, MW',
       'Total Wind Output, MW', 'Wind Output, % of Installed',
       'Wind Output, % of Load', '1-hr MW change', '1-hr % change']



In [22]:

    
df = pd.concat([pd.read_excel(fn, sn='numbers', index_col=0) for fn in files])



In [23]:

    
df.sort_index(inplace=True)



In [24]:

    
df.head()









    Out[24]:






  
    
      
      % of Daily Wind Change
      % of Daily Wind Change on Hour
      1-hr % change
      1-hr MW change
      Date
      ERCOT Load, MW
      Hour
      Month
      Total Wind Installed, MW
      Total Wind Output, MW
      Wind Daily Change on Hour
      Wind Output, % of Installed
      Wind Output, % of Load
    
  
  
    
      2007-01-01 00:00:00
      NaN
      NaN
      NaN
      NaN
      1
      30428.0
      0.0
      1.0
      2790.0
      1074.0
      NaN
      38.494624
      3.529644
    
    
      2007-01-01 01:00:00
      NaN
      NaN
      -14.096834
      -151.4
      1
      30133.0
      1.0
      1.0
      2790.0
      922.6
      NaN
      33.068100
      3.061760
    
    
      2007-01-01 02:00:00
      NaN
      NaN
      -7.955777
      -73.4
      1
      29941.0
      2.0
      1.0
      2790.0
      849.2
      NaN
      30.437276
      2.836245
    
    
      2007-01-01 03:00:00
      NaN
      NaN
      24.387659
      207.1
      1
      29949.0
      3.0
      1.0
      2790.0
      1056.3
      NaN
      37.860215
      3.526996
    
    
      2007-01-01 04:00:00
      NaN
      NaN
      -20.751680
      -219.2
      1
      30248.0
      4.0
      1.0
      2790.0
      837.1
      NaN
      30.003584
      2.767456



In [25]:

    
df = df.loc[:,cols]



In [26]:

    
df.head()









    Out[26]:






  
    
      
      ERCOT Load, MW
      Total Wind Installed, MW
      Total Wind Output, MW
      Wind Output, % of Installed
      Wind Output, % of Load
      1-hr MW change
      1-hr % change
    
  
  
    
      2007-01-01 00:00:00
      30428.0
      2790.0
      1074.0
      38.494624
      3.529644
      NaN
      NaN
    
    
      2007-01-01 01:00:00
      30133.0
      2790.0
      922.6
      33.068100
      3.061760
      -151.4
      -14.096834
    
    
      2007-01-01 02:00:00
      29941.0
      2790.0
      849.2
      30.437276
      2.836245
      -73.4
      -7.955777
    
    
      2007-01-01 03:00:00
      29949.0
      2790.0
      1056.3
      37.860215
      3.526996
      207.1
      24.387659
    
    
      2007-01-01 04:00:00
      30248.0
      2790.0
      837.1
      30.003584
      2.767456
      -219.2
      -20.751680



In [27]:

    
df.plot(y='Total Wind Installed, MW', use_index=True)









    Out[27]:





<matplotlib.axes._subplots.AxesSubplot at 0x11661c810>

Not sure why, but there is one bad row at the end. Removing it in the next cell.



In [28]:

    
df = df.iloc[:-1,:]



In [29]:

    
df.plot(y='Total Wind Installed, MW', use_index=True)









    Out[29]:





<matplotlib.axes._subplots.AxesSubplot at 0x11647d510>



In [34]:

    
df.plot(y='ERCOT Load, MW', use_index=True)









    Out[34]:





<matplotlib.axes._subplots.AxesSubplot at 0x11c5a7f10>

Looks like there is one datapoint from 2016 at the end of the file. I'm going to remove it.



In [39]:

    
sns.pointplot(df.index.year, df['ERCOT Load, MW'], aspect=1.5)
plt.title('ERCOT demand by year')









    Out[39]:





<matplotlib.text.Text at 0x11d5474d0>



In [41]:

    
df = df.iloc[:-1,:]



In [42]:

    
sns.pointplot(df.index.year, df['ERCOT Load, MW'], aspect=1.5)
plt.title('ERCOT demand by year')









    Out[42]:





<matplotlib.text.Text at 0x11d0fce90>



In [40]:

    
sns.pointplot(df.index.month, df['ERCOT Load, MW'], aspect=1.5)
plt.title('ERCOT demand by month')









    Out[40]:





<matplotlib.text.Text at 0x118c82c90>

Export clean ERCOT data



In [43]:

    
filename = 'ERCOT wind data.csv'
path = '../Clean Data'
fullpath = os.path.join(path, filename)



In [45]:

    
df.to_csv(fullpath)

Notebook to filter, clean, and join raw datafiles

Contents

ERCOT wind and load data

Export clean ERCOT data

EIA 923 data

	% of Daily Wind Change	% of Daily Wind Change on Hour	1-hr % change	1-hr MW change	Date	ERCOT Load, MW	Hour	Month	Total Wind Installed, MW	Total Wind Output, MW	Wind Daily Change on Hour	Wind Output, % of Installed	Wind Output, % of Load
2007-01-01 00:00:00	NaN	NaN	NaN	NaN	1	30428.0	0.0	1.0	2790.0	1074.0	NaN	38.494624	3.529644
2007-01-01 01:00:00	NaN	NaN	-14.096834	-151.4	1	30133.0	1.0	1.0	2790.0	922.6	NaN	33.068100	3.061760
2007-01-01 02:00:00	NaN	NaN	-7.955777	-73.4	1	29941.0	2.0	1.0	2790.0	849.2	NaN	30.437276	2.836245
2007-01-01 03:00:00	NaN	NaN	24.387659	207.1	1	29949.0	3.0	1.0	2790.0	1056.3	NaN	37.860215	3.526996
2007-01-01 04:00:00	NaN	NaN	-20.751680	-219.2	1	30248.0	4.0	1.0	2790.0	837.1	NaN	30.003584	2.767456